Main Questions

Question 1 & 2

Read the data as a pandas DataFrame. Filter the data to include only rows where Year is 1962 and then make a scatter plot comparing CO2 emissions (metric tons per capita) and gdpPercap for the filtered data.

df %>%
  filter(Year == 1962) %>%
  ggplot(aes(y = co2PerCap, x = gdpPercap)) +
  theme_classic() +
  geom_point(color = "red") +
  labs(y = "CO2 emissions (metric tons per capita)", x = "GDP in purchasing power parity (USD per capita)") +
  ggtitle("GDP vs. CO2 emissions in 1962")

df %>%
  filter(Year == 1962) %>%
  ggplot(aes(y = co2PerCap, x = gdpPercap)) +
  theme_classic() +
  scale_y_log10() +
  scale_x_log10() +
  geom_point(color = "red") +
  ggtitle("log GDP vs. log CO2 emissions in 1962") +
  xlab("log GDP in purchasing power parity (USD per capita)") +
  ylab("log CO2 emissions (metric tons per capita)")

After visualizing the original data, we see that there are some large values that are far from most of the smaller values which appear clustered/close to each other. It appears as a GPD per capita increases, CO2 emissions increases at a faster rate, up until the GDP per capital is at about 200,000. We cannot determine if the relationship between the x and y values are linear by just visualizing them.

However, given that the order of magnitude of both x and y values are large, we log transform both x (GDP) and y values (CO2).

Question 3

On the filtered data, calculate the pearson correlation of CO2 emissions (metric tons per capita) and gdpPercap. What is the Pearson R value and associated p value?

df <- df %>%
  mutate(logCO2 = log10(co2PerCap), logGDP = log(gdpPercap))

mod <- cor.test(x = df$logCO2, y = df$logGDP) %>% tidy()
mod %>%
  kbl() %>%
  kable_styling()
estimate statistic p.value parameter conf.low conf.high method alternative
0.9018728 71.77423 0 1182 0.8906647 0.9119854 Pearson’s product-moment correlation two.sided

Pearson’s correlation coefficient indicates the strength of the relationship between the two variables. logGDP is positively associated with logCO2at r=0.9.

Question 4

In what year is the correlation between CO2 emissions (metric tons per capita) and gdpPercap the strongest?

res <- df %>%
  group_by(Year) %>%
  summarise(
    tidy(
      cor.test(x = co2PerCap, y = gdpPercap, method = "kendall")
    )
  ) %>%
  dplyr::slice_max(estimate, n = 1)


res %>%
  kbl() %>%
  kable_styling()
Year estimate statistic p.value method alternative
2002 0.780129 12.90234 0 Kendall’s rank correlation tau two.sided

Kendall’s Tau correlation was used since the two variables are not normally distributed, as we have seen from Question 1 when plotting the two variables. Kendall’s Tau correlation between CO2 emissions and GDP per capita is the highest during year 2002, at r=0.78.

Question 5

Using plotly or bokeh, create an interactive scatter plot comparing CO2 emissions (metric tons per capita) and gdpPercap.

fig <- df %>%
  filter(Year == res$Year) %>%
  plot_ly(
    x = ~logGDP,
    y = ~logCO2,
    size = ~pop,
    color = ~continent,
    # frame = ~Year,
    text = ~`Country Name`,
    hoverinfo = "text",
    type = "scatter",
    mode = "markers"
  )

fig <- fig %>% layout(
  xaxis = list(
    type = "log"
  )
)

fig %>%
  layout(
    title = paste0("log GDP vs. log CO2 emissions in ", res$Year), plot_bgcolor = "#e5ecf6", xaxis = list(title = "log CO2 Emissions"),
    yaxis = list(title = "log GDP"), legend = list(title = list(text = "<b> Continent </b>"))
  )

The interactive plot above depicts the relationship between CO2 emissions and GDP per capita in the year (2002) where the correlation between the two variables is the highest as demonstrated in the question above. Hovering over the dots displays the country names, and the dot sizes correspond to the population size of that country.

More Questions

Question 1

What is the relationship between between continent and Energy use (kg of oil equivalent per capita)?

res <- df %>%
  filter(!is.na(continent)) %>%
  kruskal.test(continent, energyUsePerCap) %>%
  tidy()

res %>%
  kbl() %>%
  kable_styling()
statistic p.value parameter method
12963.99 0 21 Kruskal-Wallis rank sum test

We use the Kruskal-Wallis test because it is a non-parametric version of ANOVA. It does not assume normal distribution of residuals The test works on 2 or more independent samples, which may have different sizes.

There is a significant relationship between continent and energy use, as the p-value is smaller than the significant threshold, which we set at 0.05. The p-value is negligible because it is very close to 0.

Question 2

Is there a significant difference between Europe and Asia with respect to Imports of goods and services (% of GDP) in the years after 1990?

mod <- df %>%
  filter(continent %in% c("Asia", "Europe"), Year > 1990) %>%
  glm(importPercentageGDP ~ continent, data = .) %>%
  tidy()

mod %>%
  kbl() %>%
  kable_styling()
term estimate std.error statistic p.value
(Intercept) 46.845311 2.613728 17.922793 0.0000000
continentEurope -5.056071 3.564314 -1.418526 0.1575197

While there are many candidate statistical tests we could use to compare the difference in the variable of interest between two groups, a simple linear regression is chosen, because:

We can find out to what extent does the regressor (continent type) affects the regressand (imports of goods and services in terms of % of GDP).

\[\begin{equation} Y_i = \beta_0 + \beta_1 continent + \epsilon_i \end{equation}\]

The null hypothesis is whether \(\beta_{1}\) = 0, where variable Continent = 1 if Europe, = 0 if Asia.

We fit a linear regression model to compare the two groups. There is no significant difference between Europe and Asia with respect to the amount of imports of goods and services in terms percentage of GDP (p<0.05).

A t-test would have also provided us the answer to the question above; linear regression provides the additional advantage of informing us to what extent a change from Asia=0 to Europe=1 affect outcome variable (imports of goods and services in terms of % of GDP), which is indicated by the beta weight, -5.06.

Question 3

What is the country (or countries) that has the highest Population density (people per sq. km of land area) across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?

df %>%
  select(Year, `Country Name`, popDensityPerSqKm) %>%
  arrange(Year, desc(popDensityPerSqKm)) %>%
  group_by(Year) %>%
  dplyr::slice_max(popDensityPerSqKm, n = 3) %>%
  ggplot(data = ., aes(x = as.factor(Year), y = popDensityPerSqKm, fill = as.factor(`Country Name`))) +
  geom_bar(position = "dodge", stat = "identity") +
  theme_classic() +
  labs(x = "Year", y = "population density (per sq.km)", fill = "Country") +
  ggtitle("Population density in the top 5 highest density countries in Years 1962-2007")

res <- df %>%
  select(Year, `Country Name`, popDensityPerSqKm) %>%
  arrange(Year, desc(popDensityPerSqKm)) %>%
  group_by(Year) %>%
  mutate(
    rnks = row_number(desc(popDensityPerSqKm))
  ) %>%
  group_by(`Country Name`) %>%
  summarize(mean.rank = mean(rnks)) %>%
  slice_min(mean.rank, n = 3)

country1 <- res$`Country Name`[1]
country2 <- res$`Country Name`[2]

res %>%
  kbl() %>%
  kable_styling()
Country Name mean.rank
Macao SAR, China 1.5
Monaco 1.5
Hong Kong SAR, China 3.1

The highest-rank country in terms of population density changes across the years, as we can tell from the graph above.

To find out which country has the highest averaged ranking, we take the average of their ranks across the years based on their population density. Macao SAR, China and Monaco are tied at the first place because their averaged ranking across the period 1962-2007 is the same at 1.5.

Question 4

What country (or countries) has shown the greatest increase in Life expectancy at birth, total (years) since 1962?

res <- df %>%
  select(Year, `Country Name`, `Life expectancy at birth, total (years)`) %>%
  group_by(`Country Name`) %>%
  summarise(
    diff = `Life expectancy at birth, total (years)`[Year == 2007] - `Life expectancy at birth, total (years)`[Year == 1962],
    .groups = "drop"
  ) %>%
  dplyr::slice_max(diff, n = 5)

res %>%
  kbl() %>%
  kable_styling()
Country Name diff
Maldives 36.91615
Bhutan 33.19895
Timor-Leste 31.08515
Tunisia 30.86076
Oman 30.82310
res %>%
  ggplot(aes(x = reorder(`Country Name`, -diff), y = diff)) +
  geom_bar(position = "dodge", stat = "identity", fill = "lightblue") +
  theme_classic() +
  ggtitle("Increase in Life Expectancy in Years (Period: 1962-2007)") +
  ylab("Years") +
  xlab("Country") +
  geom_text(aes(label = round(diff, 2)), position = position_dodge(width = 0.9), vjust = -0.25)

From the graph above, we see that the top 5 countries that has shown the greatest increase in life expectancy are: Maldives, Bhutan, Timor-Leste, Tunisia, Oman

This answer is based on the absolute difference in life expectancy between year 2007 and year 1962.